Skip to content

Conversation

@cyb70289
Copy link
Contributor

@cyb70289 cyb70289 commented May 29, 2025

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.

  • 34% ~ 50% S_PP uplift for all batch sizes
  • 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------

Make sure to read the contributing guidelines before submitting a PR

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 29, 2025
@ggerganov ggerganov merged commit 54a2c7a into ggml-org:master May 29, 2025
46 checks passed
@cyb70289 cyb70289 deleted the q4k branch May 29, 2025 12:04
uint32_t utmp[4];

#if defined(__ARM_FEATURE_MATMUL_INT8)
if (nrc == 2) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyb70289: Naive question - If I understand correctly, this is the number of rows and if it has to be 2 to use SMMLA how come we see gains with Batch size 1 in Prompt prefilling ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompt prefill is different from token generation. In PP, all the tokens are process at once, the activation shape is [batch_size, prompt_tokens, embedding_size]. So I8MM is always useful for PP even if batch=1 (unless the prompt has only one token). For TG, the activation shape if [batch_size, 1, embedding_size], I8MM only works for batch > 1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @cyb70289 for taking the time to respond. That makes sense.

May I ask - what is nrc in the context of this micro-kernel ? Is it row count of the tile that this micro kernel is processing ? So, if I understand the I8MM path is triggered for cases where row count in the tile is == 2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this nrc is a constant, either 1 or 2, as the updated type_traits_cpu[] in this patch. It indicats the maximal rows this kernel can handle in oneshot. It's not related to tensor shape. But it can be reduced to 1 when the tensor is just a vector, even if the kernel can handle 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The framework will feed the kernel with appropriate nrc rows of data based on its reported capability and actual data shape.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - thanks. So basicaly the SMMLA is in action only when nrc is literally just 2. Thanks

@Nor7th
Copy link

Nor7th commented Aug 13, 2025

@cyb70289 Hi, I'm testing this patch on an N2 machine on deepseek q4_k model. Seems that sometimes it goes into your optimized branch, sometimes it falls back to the SVE branch, is this normal?

@cyb70289
Copy link
Contributor Author

@cyb70289 Hi, I'm testing this patch on an N2 machine on deepseek q4_k model. Seems that sometimes it goes into your optimized branch, sometimes it falls back to the SVE branch, is this normal?

What's the batch size? For single batch, only the prompt prefill stage may enter the optimized path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants